Last updated: 2020-02-21

Overview

  • Why go Bayesian?
  • Introduction to Bayesian analysis
    • Bayesian estimation (getting the posterior distribution)
    • MCMC sampling
    • How good are my samples?
    • Posterior predictive checks
    • Model comparison
    • Summarizing the posterior

Why go Bayesian?

Why go Bayesian?

It tells you what you want to know!

Given the data (and our prior knowledge),

  • What interval contains an unobserved parameter with .95 (or some other) probability?
  • What's the relative weight of evidence for one model (e.g., 'null') vs. another (e.g., an alternative)?

Why go Bayesian?

  • We almost always have some prior knowledge of reasonable parameter values - this should be incorporated
  • Results are not dependent on a 'sampling plan' (see, e.g., Kruschke & Liddell, 2018 for more)
  • There are incredibly flexible packages that are fairly easy to use (i.e., brms)

Introduction to Bayesian analysis

Purpose

  • To re-allocate credibility over parameter values based on the observed data (Kruschke, 2015)
  • Given the observed data, what parameter values should we most strongly believe in?
  • To obtain this we need to start with a model and some prior expectation as to the probability of parameter values in this model (more on this later)

Bayesian estimation

  • \(\theta\) = parameter(s), \(y\) = observed data
  • \(p(y \mid \theta)\) = the likelihood
  • \(p(\theta)\) = the prior
  • \(p(y)\) = probability of the data ("the evidence")
  • \(p(\theta \mid y)\) = the posterior distribution (how much to we believe in different parameter values after seeing the data?)

\[ p(\theta \mid y) = \frac{p(\theta)p(y \mid \theta)}{p(y)} \]

Bayesian estimation

\(p(y)\) does not depend on the model parameters so we can omit it in favor of the unnormalized posterior

\[ p(\theta \mid y) \propto p(\theta)p(y \mid \theta) \]

The posterior is proportional to the prior times the likelihood

Likelihood

\(p(y \mid \theta)\)

Likelihood

  • To talk about the likelihood we'll use some fake data that we'll assume is normally distributed. Think of it as time (in seconds) to read a short passage of text measured for 30 individuals
  • Goal: to estimate mean reading time (\(\mu\)) and variability (\(\sigma\))

Likelihood

\[ L(\mathbf{\theta \mid y}) = \prod^i f(\theta \mid y_i) \]

  • For this example we can assume the observations are independent and \(f\) is the normal pdf
  • \(L(\theta \mid y) \propto p(y \mid \theta)\)
  • For an introduction, see Etz (2018)

Likelihood

In R:

# likelihood of data for mu = 5, and sd = 10
prod(dnorm(x = Y, mean = 5, sd = 10)) # or exp(sum(dnorm(Y, 5, 10, log = T)))
## [1] 1.820996e-79
  • If we were doing maximum likelihood estimation we could use optim to search for the parameters that maximize the above (or minimize the negative log likelihood)
  • For Bayesian estimation we use the likelihood to update our prior beliefs in different parameter values

Likelihood

Prior

\(p(\theta)\)

Prior

  • What are our expectations for parameter values before seeing the data?
  • We will try to use "weakly informative priors" - one possible definition below (from here)

"Weakly informative prior should contain enough information to regularize: the idea is that the prior rules out unreasonable parameter values but is not so strong as to rule out values that might make sense"

Prior

Going back to our reading time example, say we have a fairly good idea of the reading rate (words per second) of the general population:

  • For the passage we gave participants we expect an average time of 40, but it could range from 20 to 60
  • Here we might choose a Normal(mean = 40, sd = 10) prior for \(\mu\)

Prior

For the standard deviation of reading times (\(\sigma\)), we might be less certain:

  • We expect people to vary on average around 10s, but we can't rule out larger values
  • A reasonable choice would be a half-Cauchy(scale = 10) prior on \(\sigma\)
  • Other possible choices: gamma, uniform, \(t\), see Gelman (2006) for discussion

Prior

We can specify the priors separately

Prior

But they determine a joint distribution

Posterior is proportional to prior \(\times\) likelihood

More on priors

Another example, linear regression:

\[ y_i \sim \mbox{Normal}(\beta_0 + \beta_1x_i, \; \sigma) \]

Prior

If \(x\) and \(y\) have been scaled (\(z\)-scored), a reasonable choice would be:

  • \(\beta_0 \sim \mbox{Normal}(0, 3)\)
  • \(\beta_1 \sim \mbox{Normal}(0, 3)\)
  • \(\sigma \sim \mbox{half-Cauchy}(2.5)\)

Prior

These priors essentially say that we expect either a positive or negative relationship between \(x\) and \(y\).

If we had strong reason to expect that \(y\) should increase with \(x\) we could instead use:

  • \(\beta_1 \sim \mbox{Normal}^{+}(0, 3)\) (where the "+" means only positive values. Same as a half or folded normal)

Prior - correlation

  • For lme models with correlated random effects, we'll need a prior for the correlation matrix (in other work you might see people put a prior on the covariance matrix)
  • brms and Stan use the LKJ prior (after Lewandowski, Kurowicka, & Joe, 2009)
  • It has one parameter, \(\eta\) (shape)

Prior - correlation

An aside on lme4 convergence

  • For models with lots of random effects lme4 convergence can be an issue
  • This usually isn't an lme4 problem - the model is too complex to be supported by the data
  • With the additional regularizing information in the prior, this should not be an issue in brms/Stan (see, e.g., this blog)
  • So it should be possible to 'keep it maximal'
  • Some still suggest that random effects structure should be simplified on grounds of parsimony (e.g., remove if the 95% credible interval for a random SD or correlation includes zero)

Prior - Summary

  • Setting priors can be tricky (slide after next has useful links)
  • When estimating parameters the main thing is to not be too restrictive (you want to let the data 'speak' and not rule out certain values without good reason)
  • For complex models (e.g., those with transformations of parameters), assessing whether the prior actually expresses what we expect can be difficult
  • It can be useful to look at the prior predictive distribution, which is essentially many simulated data sets using parameters drawn from the prior (extra slides on this at the end)
  • Typically, we will have enough data to 'overwhelm' the prior
  • If you are worried that you do not, you can do sensitivity analyses (i.e., check how much your conclusions depend on the prior)

Prior - Summary

  • For model comparison and particularly hypothesis testing with Bayes factors priors should be selected much more carefully
  • The different models you are comparing are defined by their priors
  • So they should reflect reasonable alternatives
  • For example, comparing a model with a parameter (e.g., mean difference) to one without (null), if the prior on the parameter is very diffuse (e.g., a normal with large SD) you will likely get strong evidence for the null

Useful papers/ links on priors

MCMC Sampling

Why sample?

  • The posterior, \(p(\theta \mid y)\), is a distribution but the shape of that distribution is not always directly attainable (i.e., no analytic expression)
  • In these situations sampling is needed to approximate the posterior distribution
  • This is what is offered by software like JAGS, BUGS, Stan etc

MCMC

Markov Chain Monte Carlo

  • Markov Chain - a "memoryless" chain of events. Each step depends on the current state (e.g. parameter value) and not previous ones
  • Monte Carlo - Repeated random sampling

Goal of MCMC - to approximate a target distribution

MCMC

  • The slides mcmc introduce the basics of MCMC in Bayesian analyses via a simple Metropolis Hastings algorithm (see metropolis.example.R)
  • The important points are:
    • MCMC generates a chain of samples
    • Once the chain has converged on a stable distribution, parameter values are visited in proportion to their posterior probability/ density

How accurate is the MCMC chain?

Things to consider

  1. Burn in (or warm up)
  2. Auto-correlation
  3. Effective Sample Size (ESS)
  4. Thinning
  5. Convergence and the Potential Scale Reduction Factor (PSRF)

We'll talk about 1 and 5. Slides on 2,3,4 at the end

Burn in (or warm up)

Note: Warm up for brms/Stan is more complicated and serves to tune the sampling parameters

Convergence

How do we know that we have converged onto a stable distribution?

Convergence

  • Run multiple sequences (chains) with different starting points
  • Compare variation between different chains to variation within chains. When these are approximately equal we can claim to have converged on the target distribution
  • This is measured via \(\hat{R}\). A value of 1 means equal variation between and within chains
  • Conventionally \(\hat{R} < 1.1\) is considered converged
  • \(\hat{R}\) is also referred to as the Gelman-Rubin convergence diagnostic or the Potential Scale Reduction Factor (PSRF)
  • The \(\hat{R}\) calculated by brms/Stan is slightly different (also compares the beginning and end of the same chain; see Gelman et al., 2014, pages 284-286)

Convergence

Once the model is fit

Posterior Predictive Checking

Posterior Predictive Checking

  • Posterior predictive checks are a way of evaluating the performance of a particular model and identifying potential areas of misfit
  • Involves simulating outcomes (\(y_{\mbox{rep}}\)) from the model using each step in the MCMC chain as the parameter settings
  • Thus, it incorporates the uncertainty in the estimated parameters

Posterior Predictive Checking

For each step in the chain (or a random subset of the chain), simulate \(N\) observations from the model with parameters set to the current step in the chain (where \(N\) is the size of the original data set)

We can compare these simulated outcomes to the observed data

Posterior Predictive Checking

Posterior Predictive Checking

  • We can also calculate specific quantities (e.g., max, min, mean) for each posterior predictive sample and compare to those from the observed data set
  • If a particular statistic (\(T\)) from the observed data is rare under the model predictions, this is indicative of misfit and potential for model improvement

Posterior Predictive Checking

Posterior Predictive Checking

Model Comparison

Model Comparison

  • Out-of-sample prediction accuracy
    • Approximate leave-one-out (LOO) cross validation
    • Widely Applicable Information Criterion (WAIC)
  • Marginal likelihood
    • Bayes factor (BF)
    • Posterior model probability

LOO and WAIC

Both try to estimate the expected log predictive density (elpd) for new data

  • LOO can be approximated by importance sampling, specifically we will use Pareto smoothed importance sampling (psis) as implemented in the loo package (see Vehtari et al., 2017 for details)
  • WAIC is a Bayesian extension of AIC (Watanabe, 2010). It essentially estimates the effective number of parameters of the model and uses this to penalize its predictive accuracy

Larger elpd is better but note that LOO and WAIC are often reported on deviance scale (multiplied by -2), in which case smaller values indicate better fit.

LOO and WAIC

  • Asymptotically, WAIC and LOO should the the same, although LOO is advocated for more strongly where possible (see Vehtari et al., 2017)
  • When not possible (as we might see in some brms examples), \(K\)-fold cross validation can be used

Bayes factors

\[ \frac{p(M_1 \mid y)}{p(M_2 \mid y)} = \frac{p(y \mid M_1)}{p(y \mid M_2)} \times \frac{p(M_1)}{p(M_2)} \]

\[ \frac{p(M_1 \mid y)}{p(M_2 \mid y)} = BF_{1,2} \times \frac{p(M_1)}{p(M_2)} \]

  • The Bayes factor, \(BF\), is the 'updating factor'
  • How much does our belief in model 1 over model 2 change, having seen the data?

Marginal Likelihood

\[ p(y \mid M) = \int p(y \mid \theta, M) p(\theta \mid M) d\theta \]

  • 😱
  • We can use bridge sampling to estimate the marginal likelihood (see Gronau et al., 2017 for an introduction)
  • There are other approaches such as 'transdimensional MCMC' or the JZS 'default' Bayes factors implemented in the BayesFactor package (only normal models)

Summarizing the Posterior

Credible interval and highest density interval

You will see both of these around…

  • 95% Credible Interval (CI): the 2.5% to 97.5% quantiles (output by default in brms)
  • 95% Highest Density Interval (HDI): an interval containing 95% of the posterior mass such that values contained within the interval have higher posterior density than values outside the interval (can use HDInterval::hdi() to calculate)

Credible interval and highest density interval

Credible interval and highest density interval

Assessing null values with one model fit

ROPE

Region Of Practical Equivalence

  • Requires setting a boundary around some value (e.g., zero)
  • Values within the boundary are considered "practically equivalent" to the chosen value

ROPE

The Savage-Dickey Bayes factor

  • AKA the Savage-Dickey density ratio (see Wagenmakers et al., 2010 for an introduction)
  • For a point hypothesis regarding a parameter value, we compare the height of the posterior distribution to the height of the prior distribution at that particular value
  • The relative height of the posterior and prior tells us how much our belief in that particular value has changed after seeing the data

The Savage-Dickey Bayes factor

The Savage-Dickey Bayes factor

Summary

Steps we'll follow in our brms examples:

  1. Figure out what model is appropriate for the data at hand
  2. Specify reasonable priors for model parameters (need to be especially careful if you want Bayes factors)
  3. Fit model (using brms, Stan, etc) and ensure chains have converged (we'll cover other possible warnings/errors specific to Stan)
  4. Posterior predictive plots - are there areas for improvement?
  5. Refine model, compare competing models, …
  6. Examine posterior quantities (mean, median, CI, HDI)

End of introduction to Bayesian analysis

Extra slides

Prior Predictive Distribution

  • When setting priors for complex models it can be useful to look at the distribution of data implied by the prior distribution
  • We can simulate data from draws from the prior distribution
  • Do these predictions fall within a reasonable range? Domain expertise needed here

The next slides show examples of simulated data for the linear regression example (\(y_i \sim \mbox{Normal}(\beta_0 + \beta_1x_i, \sigma)\)) - with 100 observations of x=0 and 100 of x=1

Prior Predictive Distribution

Prior Predictive Distribution

Prior Predictive Distribution

Prior Predictive Distribution

  • Do these plots look reasonable? Note we could (and should) have looked at other quantities of the simulated data (e.g., min, max, condition diffs)
  • Do they strike the balance of not putting too much prior weight on unlikely (in your expert opinion) outcomes while not being overly restrictive?
  • For psychologists, we'll usually have enough data to 'overwhelm' mildly informative priors

More on sampling

Autocorrelation

  • Sometimes the sampler does not explore the parameter space effectively
  • Below the samples are very autocorrelated (not independent)

Autocorrelation

  • How does correlation between points in the chain change with different lags between?
  • Left panel is bad, right panel is good

Effective Sample Size (ESS)

A way of estimating the number of independent samples once accounting for autocorrelation:

\[ ESS = \frac{N}{1 + 2\sum_{k = 1}^{\infty} \rho_k} \]

Where \(\rho_k\) is the auto-correlation at lag \(k\). Think of this as dividing the number of samples by the amount of auto-correlation. In practice the sum stops when the auto-correlation is small (e.g. \(\rho_k < 0.05\); see Kruschke, 2015, p. 184).

Thinning

By discarding every \(k^{\mbox{th}}\) sample can reduce autocorrelation (below \(k=10\))

Thinning

By discarding every \(k^{\mbox{th}}\) sample can reduce autocorrelation (below \(k=10\))

Thinning

  • If auto-correlation is really bad thinning might help, but it might suggest deeper problems with your model (see Gelman et al., 2014)
  • It has been claimed that thinning is "often unnecessary and always inefficient" (Link & Eaton, 2012)
  • Often it is better to keep the full chains